The contributions on this project have been provided by :
In this notebook network science concepts are used to identify immigration , emmigration patterns and its plausible causal factors. International migration has increased far more then previous years due to ease of travel , improved modes of communication & better air / road traffic networks. The world bank database has been utilized to study the factors that show an impact on immigration between the years 1990 - 2009. This study is a prototype which can be used for larger datasets of migration data.
These are the concepts that this project highlights:
Using the world bank data & netwrok science concepts, perform analysis and perform the following activities
● Identify indices among the network of people migrations (Nodes and Edges)
● Identify a pattern of people migrating from one country to another(using links between the nodes and edges where one country is the source and another the destination)
● Find the centralities among the network of countries
● Determine an optimum modularity and generate communities
import csv
from operator import itemgetter
import networkx as nx
from networkx.algorithms import community
import pandas as pd
df = pd.read_excel (r'Data_Extract_From_Global_Bilateral_Migration2000.xlsx', sheet_name='Sheet1')
nodes_list = df['Country Origin Code'].unique()
df_adjList = list(zip(df['Country Origin Code'].map(str), df['Country Dest Code'].map(str),df['Total [TOT]'].map(str)))
df.head()
#generating the digraph
import networkx as nx
G = nx.DiGraph()
G.add_nodes_from(nodes_list)
G.add_weighted_edges_from(df_adjList)
nx.write_gexf(G, "digraphProjectweighted.gexf")
#Removing spaces from the column Names
dfrename=df
dfrename.columns = dfrename.columns.str.strip().str.lower().str.replace(' ', '').str.replace('(', '').str.replace(')', '')
dfrename.head()
Identified the top 10 countries within the 4 communities where immigration is relatively more. Visibly, USA has notable migrant population , followed by the United Kingdom.
Identified the top 10 countries within the 6 communities where 6mmigration is relatively more.Mexico and India seem to be the countries that have larger populations moving elsewhere.
Observed the betweenness centrality as the most significant among all top 10 countries identified for immigration and emigration.Indicating Poland and France having higher significance in the year 2000.
betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex. Reference: https://en.wikipedia.org/wiki/Betweenness_centrality
The data cleansing steps to generate the graphs to prove out the centralities are as follows:
#Filter top10 as shown in the Directed Weighted graoh by observing the weighted in-degree and weighted out-degree
dfemifilter = pd.read_csv (r'project_migrationgexftableweight.csv')
dfemifiltertop10=dfemifilter.sort_values(by='weighted outdegree', ascending=False)
dfemifiltertop10=dfemifiltertop10.nlargest(10, 'weighted outdegree')
dfemifiltertop10=dfemifiltertop10[['Label']]
dfemifiltertop10
Above are the top 10 countries for emigration identified from the Network plotted.
dfemifiltertopanalysis=dfrename[dfrename.countryorigincode.isin(dfemifiltertop10.Label)]
dfemifiltertopanalysis.head()
dfemifiltertopanalysis.head()
#Filter top10 as shown in the Directed Weighted graoh by observing the weighted in-degree and weighted out-degree
dfemifiltertopanalysisnodes_list = dfemifiltertopanalysis['countryorigincode'].unique()
dfemifiltertopanalysis_adjList = list(zip(dfemifiltertopanalysis['countryorigincode'].map(str), dfemifiltertopanalysis['countrydestcode'].map(str),df['total[tot]'].map(str)))
Gemitopanalysis = nx.DiGraph()
Gemitopanalysis.add_nodes_from(dfemifiltertopanalysisnodes_list)
Gemitopanalysis.add_weighted_edges_from(dfemifiltertopanalysis_adjList)
nx.write_gexf(Gemitopanalysis, "digraphGemitopanalysis.gexf")
Centrality based network For Emigration
As indicated above, Poland and Italy have significant betweeness centralities.
dfimmifilter = pd.read_csv (r'project_migrationgexftableweight.csv')
dfimmifiltertop10=dfimmifilter.sort_values(by='weighted indegree', ascending=False)
dfimmifiltertop10=dfimmifiltertop10.nlargest(10, 'weighted indegree')
dfimmifiltertop10=dfimmifiltertop10[['Label']]
dfimmifiltertop10
Above are the top 10 countries for emigration identified from the Network plotted.
dfimmifiltertopanalysis=dfrename[dfrename.countrydestcode.isin(dfimmifiltertop10.Label)]
dfimmifiltertopanalysis.head()
dfimmifiltertopanalysisnodes_list = dfimmifiltertopanalysis['countryorigincode'].unique()
dfimmifiltertopanalysis_adjList = list(zip(dfimmifiltertopanalysis['countryorigincode'].map(str), dfimmifiltertopanalysis['countrydestcode'].map(str),df['total[tot]'].map(str)))
Gimmitopanalysis = nx.DiGraph()
Gimmitopanalysis.add_nodes_from(dfimmifiltertopanalysisnodes_list)
Gimmitopanalysis.add_weighted_edges_from(dfimmifiltertopanalysis_adjList)
nx.write_gexf(Gimmitopanalysis, "digraphGimmitopanalysis.gexf")
Centrality based network For Immigration. Significant countries are France and India
Data from the world bank from 1990 - 2009 has been merged to consider all possible factors for international Migration.
The idea is that migration - be it immigration or emmigration caused by co-related factors would have a positive impact for one country and the inverse for the other.
For example; if the annual education rate for a target country has increased based on immigration, the education rate in the source country would be negatively impacted leading to brain drain. Hence a combination of all possible factors were consider for the dimensionality reduction.
# Below, 1. we combine the entire dataset from worldbank which contains around 4000+ series codes that contribute to
# factors and filter by the top countries for immigration and emigration identified in Gephi.
# 2. Get the clean unique series codes that appear in all the datasets (emigration and immigration).
# Benchmark count of a minimum of 18 which will give us the no. of series that can be used to find the Correlation
# This no. is decided based on the no of years that we use in the dataset.
import pandas as pd
import numpy as np
train_df = pd.read_csv('NEWDATA.csv')
eccf=['MEX','IND','CHN','POL','BGD','PAK','DEU','KAZ','ITA','PHL']
train_df_emifilter=train_df[train_df.CountryCode.isin(eccf)]
train_df_emifilter = train_df_emifilter.replace('..',np.nan)
train_df_emifilter.dropna(axis=0, how='any', thresh=10, subset=None, inplace=True)
iccf=['USA','RUS','DEU','IND','FRA','GBR','SAU','CAN','PAK','HKG']
train_df_imifilter=train_df[train_df.CountryCode.isin(iccf)]
train_df_imifilter = train_df_imifilter.replace('..',np.nan)
train_df_imifilter.dropna(axis=0, how='any', thresh=10, subset=None, inplace=True)
# Cleaning and merging the 2 datasets (emigration top 10 and immigration top 10)
train_df_concateseries=pd.concat([train_df_emifilter, train_df_imifilter])
train_df_series = train_df_concateseries.filter(['SeriesCode'], axis=1)
train_df_series = train_df_series[train_df_series.duplicated()]
train_df_series=train_df_series['SeriesCode'].value_counts()
train_df_series=train_df_series.to_frame()
train_df_series.rename(columns={'SeriesCode': 'Counts'}, inplace = True)
train_df_series['Series'] = train_df_series.index
train_df_series['ID'] = np.arange(len(train_df_series))
train_df_seriess=train_df_series.set_index('ID')
train_df_seriess=train_df_seriess[train_df_seriess['Counts'] > 18]
train_df_seriess.head()
# Merging the dataset with emigration countries and series codes that have been indetified to have enough data and are common for all top 10 countries
merged_inner_emifilter = pd.merge(left=train_df_emifilter, right=train_df_seriess, left_on='SeriesCode', right_on='Series', how='inner')
merged_inner_emifilter.head()
# Merging the dataset with immigration countries and series codes that have been identified to have enough data and are common for all top 10 countries
merged_inner_imifilter = pd.merge(left=train_df_imifilter, right=train_df_seriess, left_on='SeriesCode', right_on='Series', how='inner')
merged_inner_imifilter.head()
len(eccf)
def transpose_per_ctry(merged_inner_emifilter,eccf,Cat):
train_dfall=pd.DataFrame()
for i in range(len(eccf)):
train_df_emi0=merged_inner_emifilter[merged_inner_emifilter['CountryCode']==(eccf[i])]
train_df_emi0=train_df_emi0.drop(['CountryName', 'SeriesName','CountryCode','Series','Counts'], axis=1)
train_df_emi0=train_df_emi0.transpose()
train_df_emi0.columns = train_df_emi0.iloc[0]
train_df_emi0 = train_df_emi0[1:]
train_df_emi0=train_df_emi0.astype('float64')
train_df_emi0['Category']=Cat
train_dfall=pd.concat([train_df_emi0, train_dfall])
i=i+1
return train_dfall
train_dfallemi=transpose_per_ctry(merged_inner_emifilter,eccf,'EMI')
train_dfallimi=transpose_per_ctry(merged_inner_imifilter,iccf,'IMI')
train_df_full=pd.concat([train_dfallemi,train_dfallimi])
train_df_full.reset_index(inplace=True)
train_df_full.rename(columns={'index':'Year'},inplace=True)
train_df_full.head()
Category = train_df_full['Category']
train_df_full.drop(labels=['Category'], axis=1,inplace = True)
train_df_full.insert(0, 'Category', Category)
train_df_full.describe()
cordfhm = train_df_full.copy()
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
label_encoder = LabelEncoder()
cordfhm.iloc[:,0] = label_encoder.fit_transform(cordfhm.iloc[:,0]).astype('float64')
cordfhm.head()
corr = cordfhm.astype('float64').corr().abs()
corr.head()
plt.figure(figsize=(15, 15))
sns.heatmap(corr,cmap='BuGn')
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if (corr.iloc[i,j] >= 0.9):
if columns[j]:
columns[j] = False
selected_columns = train_df_full.columns[columns]
train_df_corr = train_df_full[selected_columns]
train_df_corr.head()
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', -1)
train_dfsrslist=train_df_corr.columns.to_frame()
train_dfsrslist.head()
train_dfsrs=train_df_concateseries[['SeriesName','SeriesCode']]
train_dfsrs=train_dfsrs.drop_duplicates()
train_dfsrslist=train_df_corr.columns.to_frame()
train_dfsrslist.rename(columns={'SeriesCode':'List'},inplace=True)
train_dfsrslist.reset_index()
train_dfsrslist=train_dfsrslist.drop(columns ='List')
train_dfsrslist=train_dfsrslist.iloc[2:]
train_dfsrslist
corrcols = pd.merge(left=train_dfsrs, right=train_dfsrslist, left_on='SeriesCode', right_on='SeriesCode', how='inner')
corrcols.drop_duplicates(inplace=True)
corrcols.head()
The feature selector library being used below is available here: https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0 , feature selector tool created by https://towardsdatascience.com/@williamkoehrsen
train_fs=train_df_corr.copy()
train_fs.to_csv('corr_results.csv')
# Creating train and train-labels which are required in the feature selector algorithm that is to be used below:
train_labels = train_fs['Category']
train = train_fs.drop(columns = ['Category'])
train.head()
train_fs = train.copy()
train_fs.to_csv('train_fs_results.csv', index=False)
train_fs_df = pd.read_csv('train_fs_results.csv')
train_fs_df.head()
from feature_selector import FeatureSelector
fs = FeatureSelector(train_fs_df,train_labels)
fs.identify_missing(0.9)
fs.plot_missing()
fs.identify_single_unique()
fs.plot_unique()
fs.identify_collinear(0.85)
fs.plot_collinear()
fs.record_collinear.head()
fs_coll_col = fs.record_collinear
fs.identify_zero_importance(task = 'classification', eval_metric = 'auc',
n_iterations = 10, early_stopping = True)
fs.plot_feature_importances(threshold = 0.85)
fs.identify_single_unique()
train_df_pca=train_fs_df.copy()
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
train_df_pca.dropna(inplace=True)
array = train_df_pca.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=5)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)
print(pca)
pca = PCA(n_components=5)
data_vizimi = pd.merge(left=merged_inner_imifilter, right=fs_coll_col, left_on='SeriesCode', right_on='corr_feature', how='inner')
data_vizimi.drop(columns=['Series','drop_feature','corr_feature'],inplace=True)
data_vizemi = pd.merge(left=merged_inner_emifilter, right=fs_coll_col, left_on='SeriesCode', right_on='corr_feature', how='inner')
data_vizemi.drop(columns=['Series','drop_feature','corr_feature'],inplace=True)
data_vizimi['category']='IMI'
data_vizemi['category']='EMI'
data_vizmi=pd.concat([data_vizimi,data_vizemi])
#data_vizmi.to_csv('mipre0.csv')
data_vizmi.head()
data_vizmip=data_vizmi.melt(id_vars =['CountryName','CountryCode','SeriesName','SeriesCode','category'], value_vars =['1995', '1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009'],
var_name ='Year', value_name ='Values')
data_vizmip.to_csv('mipre.csv')
data_vizmip=read_csv('mipre.csv')
data_vizmip
data_vizmip['RN'] = data_vizmip.groupby(['SeriesCode','Year'])['Values'].rank(method='first',ascending=False)
data_vizmip['RNRev'] = data_vizmip.groupby(['SeriesCode','Year'])['Values'].rank(method='first',ascending=True)
data_vizmip
data_vizmip.sort_values(by=['SeriesCode', 'Year','Values','RN'])
data_vizmip.to_csv('mi.csv')
from numpy.random import shuffle
import networkx as nx
import scipy.stats as stats
G = nx.DiGraph()
G.add_nodes_from(nodes_list)
G.add_weighted_edges_from(df_adjList)
nx.write_gml(G, "digraphProjectweighted.gml")
migrationgml = nx.read_gml("digraphProjectweighted.gml")
migration = migrationgml
migration = nx.DiGraph(migration)
print(type(migration).__name__)
migration_assortativity = nx.degree_assortativity_coefficient(migration)
migration_transitivity = nx.transitivity(migration)
print(migration_assortativity)
print(migration_transitivity)
# Function to compute the measures using the configuration model
degree_sequence = list(dict(nx.degree(migration)).values())
transitivity = []
assortativity = []
def model_metrics(graph,n):
for i in range(n):
null_graph = nx.configuration_model(degree_sequence)
null_graph = nx.Graph(null_graph)
null_graph.remove_edges_from(null_graph.selfloop_edges())
transitivity.append(nx.transitivity(null_graph))
assortativity.append(nx.degree_assortativity_coefficient(null_graph))
return transitivity, assortativity
migration_mod_transitivity, migration_mod_assortativity = model_metrics(migration,1000)
p_zscores_assortativity = stats.zscore( [migration_assortativity]+ migration_mod_assortativity)
p_zscores_transitivity = stats.zscore([migration_transitivity] + migration_mod_transitivity)
# Just print out the first score which corresponds to the real network
print(p_zscores_assortativity[0])
print(p_zscores_transitivity[0])
%matplotlib inline
import matplotlib.pyplot as plt
plt.subplots(figsize=(14,6))
# Use the histogram function to plot the distribution of assortativity coefficients
plt.subplot(1, 2, 1)
plt.hist(migration_mod_assortativity, bins=30)
plt.axvline(migration_assortativity, lw=2, color="red")
plt.title('Assortativity in power networks')
# Use the histogram function to plot the distribution of transitivity coefficients
plt.subplot(1, 2, 2)
plt.hist(migration_mod_transitivity,bins=30)
plt.axvline(migration_transitivity,lw=2,color='red')
plt.title('Transitivity in power networks')
plt.show()
Conclusion
Due to the high P value we can say that we fail to reject the null hypothesis. So there is evidence to show that the assortivity and transitivity affect the migration network and these can be accounted for by the degree sequence.
Since migration networks highly rely on connectivity between nodes connected to the same node and assortative mixing which is the preference for a network's nodes to attach to others that are similar in some way , especially among the top influential countries for emigration and immigration provide further clarity on our inferences.
For the purposes of investigation we will be considering 1 Country identified as an emigrant country (Mexico) and Validate the selected features against all of the Immigrant countries. In a similar manner we will be considering 1 Country (United States) and will be validating the features against all of the Immigration Countries. In order to plot these inferences a data visualization tool such as power BI was used to show the relationship between these factors and to study the impacts to immigration.
Inference 1: Factors comparing with countries with higher out degree and higher in degree
Immigrant countries Vs Mexico
Emigrant countries Vs USA
We can see that the selected feature shows that the countries of highter average weighted in degree have lesser female populations between the ages of 25-29.
As cited in __: Gender and Migration: An Integrative Approach , By Nana Oishi (https://ccis.ucsd.edu/_files/wp49.pdf) :__
"The patterns of international female migration can be explained by three levels of
analyses from the “sending side”: (1) the state; (2) individuals; and (3) society. At the state
level, emigration policies treat men and women differently. Because women are not a valueneutral workforce but the symbols of national dignity and pride, the government tends to have
protective and restrictive emigration policies for women. Emigration policies for women tend to
be value-driven rather than those for men which are economically driven."
Immigrant countries Vs Mexico
Emigrant countries Vs USA
As seen in the visuals above, we can see that indeed the immigrant countries do much better at providing domestic credit.On further study of this issue as cited in :
__Luu, Trang Heidi, "International Migration and FDI: Can Migrant Networks Foster Investments toward Origin Countries?" (2019). Honors Projects. 141. https://digitalcommons.iwu.edu/econ_honproj/141
__ "When correcting for nonstationarity, US FDI toward the foreign country is found to
be negatively impacted by both the acceleration in the country’s GDP per capita and the
geographical distance between the two. The foreign country’s domestic credit to the private
sector, on the other hand, seems to be significant in attracting FDI in this model."
Immigrant countries Vs Mexico
Emigrant countries Vs USA
As seen in the visuals above,the immigrant countries have more age dependency ratio
__"https://wol.iza.org/uploads/articles/99/pdfs/impact-of-aging-on-scale-of-migration.pdf"__
As cited "The majority of international migrants move from less- to more-developed countries, with
the largest shares of migrants being working-age individuals. However, the proportion of
elderly migrants is also non-negligible "
This accounts for for the indirect interactions between countries and population of the destination country. Hence Poland plays a critical role in migrations across the globe. Factors associated with Poland were investigated.
dfGraph = pd.read_csv (r'project_migrationgexftableweight.csv')
dfGraph=dfGraph.sort_values(by='Weighted Degree', ascending=False)
dfGraphclose=dfGraph.nlargest(10, 'closnesscentrality')
dfGraphclose
dfGraphbet=dfGraph.nlargest(10, 'betweenesscentrality')
dfGraphbet
Based on the above methodology, we were able to build relevant network diagrams upon cleanup of the data, We observed that there are significant countries that can be identified based on the links between the nodes which are the countries and links which are the connections between the nodes.
Networks generated in this process only used data from 2000 to build the nodes and edges. This can be further expanded to additional timelines and insightful inferences can be obtained.